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A commentary on 

Audio-visual onset differences are used to 
determine syllable identity for ambiguous 
audio-visual stimulus pairs 

by Ten Oever, S., Sack, A. T„ Wheat, 
K. L., Bien, N., and van Atteveldt, 
N. (2013). Front. Psychol. 4:331. doi: 
10.3389/fpsyg.2013.00331 

The amount of research focused on mul- 
tisensory speech perception has expanded 
considerably in recent years. Much of 
this research has focused on which fac- 
tors influence whether or not an auditory 
and a visual speech input are "integrated" 
(i.e., perceptually bound); a special case 
of how our perceptual systems solve the 
"binding problem" (Treisman, 1996). The 
factors that have been identified as influ- 
encing multisensory integration can be 
roughly divided into two groups. First are 
the low-level stimulus factors that include 
the physical characteristics of the sensory 
signals. The most commonly studied of 
these include the spatial (e.g., Macaluso 
et al, 2004; Wallace et al, 2004) and tem- 
poral (e.g., Miller and D'Esposito, 2005; 
Stevenson et al, 2011) relationship of the 
two inputs, and their relative effectiveness 
(e.g., James et al, 2012; Kim et al, 2012) in 
driving a neural, perceptual, or behavioral 
response. The second group of factors can 



be considered more higher-order or cogni- 
tive, and include factors such as the seman- 
tic congruence of the auditory and visual 
signals (Laurienti et al., 2004) or whether 
or not the gender of the speaker's voice 
is matched to the face (Lachs and Pisoni, 
2004). 

While these two categories can be 
considered conceptually distinct, they are 
related because of their mutual depen- 
dence upon the natural statistics of sig- 
nals in the environment. When auditory 
and visual speech signals are closely prox- 
imate in time (low-level), they are more 
likely to have originated from the same 
speaker, and thus should be integrated 
(Dixon and Spitz, 1980; Stevenson et al., 
2012b). Likewise, if an auditory and a 
visual speech signal are semantically con- 
gruent (high-level), they are more likely 
to have originated from the same speaker 
and thus should be integrated (Calvert 
et al., 2000). Given that these low- and 
high-level factors are each reflective of the 
natural statistics of the environmental sig- 
nals, they will generally co-vary. Taking 
speech as an example, in a natural setting, 
the temporally-coincident auditory and 
visual components of a syllable or word 
are also semantically congruent (Spence, 
2007). 

To date, most research has investi- 
gated these low- and high-level factors 



independently. These studies have been 
highly informative, providing descriptions 
as to how each of these factors con- 
tributes to the process of multisensory 
integration. What has not received a great 
deal of focus is the interplay between 
these factors. A handful of experiments 
have investigated how low-level factors 
interact with one another and influence 
multisensory integration (Macaluso et al., 
2004; Royal et al, 2009; Stevenson et al, 
2012a), but few have attempted to bridge 
between low-level stimulus-characteristics 
and high-level cognitive factors (Vatakis 
and Spence, 2007). A recent article by 
Ten Oever et al. (2013), Audio-visual onset 
differences are used to determine syllable 
identity for ambiguous audio-visual stimu- 
lus pairs addresses this gap in our under- 
standing by investigating the interaction 
between stimulus timing and semantic 
congruency modulated by changes in place 
of articulation or voicing. 

In this study, participants were pre- 
sented with single-syllable stimuli, with 
auditory, visual, and audiovisual sylla- 
bles systematically manipulated accord- 
ing to place of articulation and voic- 
ing. In addition, the temporal alignment 
of the audiovisual presentations was also 
parametrically varied. Hence, semantic 
content was varied through changes both 
in the auditory (voicing) and visual (place 
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FIGURE 1 | The left panel shows a "parallel accumulator" model with 
auditory and visual evidence racing toward threshold (y). The amount of 
visual influence on the auditory signal is a function, f, of parameters 5, and 9 
which represent temporal coincidence detection and phonetic congruence 
respectively, both of which contribute evidence to a single accumulator. The 



serial model on the right shows two separate stages where integration is 
affected first by temporal, then by semantic processing. Hence, in stage 1, 
visual information influences auditory processing only as a function f of 
temporal coincidence. In stage 2, visual information influences auditory 
processing solely as a function, g, of phonemic compatibility. 



of articulation) signals, while at the same 
time, the relative timing of the auditory 
and visual stimuli were systematically var- 
ied. While the results specific to these fac- 
tors are interesting on their own, most 
germane to this commentary is how these 
two factors interacted. The authors mea- 
sured the window of time within which 
the visual cue influenced the syllable that 
was heard. This probabilistic construct, 
referred to as the "time window of integra- 
tion" or the "temporal binding window," 
has been shown to vary greatly accord- 
ing the type of stimulus being integrated 
(Vatakis and Spence, 2006; Stevenson and 
Wallace, 2013). In the Ten Oever et al. 
study, semantically congruent stimuli were 
found to be associated with a wider tem- 
poral binding window than semantically 
incongruent stimuli. That is, stimulus 
components that are semantically matched 
have higher rates of integration at more 
temporally disparate offsets. 

The result is surprising in that it runs 
counter to predictions generated by hier- 
archical serial models. In such models, 
lower-level properties such as stimulus 
timing are processed initially, and are 
then followed by the processing of the 
linguistic (i.e., semantic) content in the 
auditory and visual signals. However, the 
current results, by illustrating an inter- 
action between timing and congruency, 



suggest that hierarchical models are insuf- 
ficient to explain the data. Rather, we 
posit that these results are better inter- 
preted within a "parallel accumulation of 
evidence" framework (Figure 1). In this 
model, the temporal relationship of two 
sensory inputs provides important infor- 
mation about the likelihood that those two 
inputs originated from the same speaker 
and should be integrated. In addition, 
the semantic congruence of these inputs 
also provides information as to whether 
or not the two sensory inputs should be 
bound. Importantly, these two types of 
evidence are pooled into a single decision 
criterion. Thus, within such a framework, 
when stimuli are semantically congruent, 
a decreased amount of temporal align- 
ment is needed in order to cross a deci- 
sion bound that would result in these 
two inputs being integrated, manifesting 
in a broader temporal binding window for 
semantically congruent speech stimulus 
pairs. 

Through this interaction between stim- 
ulus timing and semantic congruence, 
Ten Oever and colleagues provided com- 
pelling evidence that low-level stimulus 
and high-level cognitive factors are not 
processed in a completely serial manner, 
but rather interact with one another in 
the formation of a perceptual decision. 
These results have significant implications 



in informing our view as to the neurobi- 
ological substrates involved in real-world 
multisensory perceptual processes. Most 
importantly, the work suggests that sig- 
nificant feedforward and feedback circuits 
are engaged in the processing of natural- 
istic multisensory stimuli, and that these 
circuits work in a parallel and cooper- 
ative fashion in evaluating the statistical 
relations of the stimuli to one another 
on both their low-level (i.e., stimulus fea- 
ture) and high-level (i.e., learned seman- 
tic) correspondences. 
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